Encontro 2 | 19/08/20024 Henrique Costa | Métodos Estratégicos em FinQuant
Dados e R
Funções
Receitas permitem que os chefs preparem guloseimas saborosas
As receitas pedem ingredientes
Recipes involve one or more steps
As etapas transformam os ingredientes em guloseimas
Funções são como receitas personalizáveis
Funções solicitam entradas (“argumentos”)
As funções envolvem uma ou mais linhas de código
O código transforma entradas em saídas
O uso de funções requer parênteses (geralmente)
doce <- f(ing1, ing2)
Funções - Prática
# CASO DE USO: A função pode executar uma tarefa de forma mais fácil e legível# MODELO: saída <- nome_da_função(entrada)9^ (1/2)x <-sqrt(9)x# ==============================================================================# LIÇÃO: Também podemos usar funções para transformar objetosy <-9sqrt(y)# ==============================================================================# LIÇÃO: Podemos até usar funções para transformar o resultado dos cálculos2/3round(2/3)# ==============================================================================# LIÇÃO: Podemos personalizar o que uma função faz usando argumentos# MODELO: saída <- nome_da_função(argumento, nome_do_argumento = valor_do_argumento)round(2/3, digits =2)round(2/3, digits =3)# ==============================================================================# LIÇÃO: Alguns argumentos são opcionais porque têm valores padrãoround(2/3) # the default value for digits is 0round(2/3, digits =0)
Vectores
Vetores combinam objetos semelhantes em uma coleção
Gosto de imaginar um trem puxando vários vagões
Um vetor é um objeto com muitos subobjetos
Nós nos referimos a cada subobjeto como um elemento
Algumas funções transformam cada elemento um de cada vez
Dobrar a quantidade de carga em cada vagão
Algumas funções resumir em todos os elementos
Calcule a carga total em todos os vagões do trem
v <- c(1, 2, 3, 4, 5)
Vetores - Prática
# LIÇÃO: Podemos combinar vários elementos em um vetor# MODELO: nome_do_vetor <- c(elemento1, elemento2, elemento3)x <-491625# errorx <-c(4, 9, 16, 25)xy <-c(2, 3)y# ==============================================================================# LIÇÃO: Também podemos combinar vários vetores e elementosc(x, y)c(x, y, 20)# ==============================================================================# CASO DE USO: Operadores matemáticos transformarão cada elemento individualmentex +1x *3x # mas, novamente, isso não será salvo a menos que você use atribuição# ==============================================================================# CASO DE USO: Algumas funções também transformarão cada elemento individualmentesqrt(x)log(x)# ==============================================================================# CASO DE USO: Outras funções resumirão o vetor com um único númerolength(x)sum(x)mean(x)
Strings
Ao programar com R, precisamos de uma maneira de distinguir
Nomes de objetos/funções (por exemplo, a função mean)
Dados de texto/caractere (por exemplo, a palavra mean)
Strings são a maneira do R armazenar dados de texto
Strings podem armazenar qualquer caractere (sem regras!)
As strings são criadas e exibidas com quotes
::::
Strings
R tem ótimas ferramentas para trabalhar com strings
As strings podem ser coletadas em vetores
Funções especiais podem transformar strings
name <- "John Doe"
:::
Strings - Prática
# CASO DE USO: Strings são a principal maneira de armazenar dados de caracteres em Rmy_color <- red # errormy_color <-"red"# correto# ==============================================================================# CASO DE USO: Strings também podem armazenar símbolos não permitidos em nomes de objetosdye <-"red#40"dyedyes <-c("red#40", "blue#02")dyes# ==============================================================================# ARMADILHA: Muitas operações que você pode fazer com números não funcionarão para stringsdyes +1# errormean(dyes) # error# ==============================================================================# CASO DE USO: Mas outras operações funcionam para ambos ou mesmo apenas para stringslength(dyes)nchar(dyes)dyes2 <-toupper(dyes)dyes2
Packages (Pacotes)
Livros de receitas são uma ótima maneira de aprender a cozinhar
Eles contêm muitas receitas e instruções
Navegue em uma livraria online para encontrar um livro de receitas
Encomende para adicionar à sua estante pessoal
Para usar, retire o livro de receitas da prateleira
Packages (Pacotes)
Pacotes são como livros de receitas para R
Eles contêm funções e conjuntos de dados úteis
Navegue em um repositório online para um pacote
Instale para adicioná-lo à sua biblioteca pessoal
Para usar, carregue o pacote da biblioteca
library("pkg_name")
Packages - Prática
# CASO DE USO: O pacote stringr adiciona uma função para corrigir a capitalizaçãostudents <-c("mary anne", "BENjamin", "Lee")# ==============================================================================# ARMADILHA: Mas não podemos usar essa função sem instalar o pacotestr_to_title(students) # error# ==============================================================================# LIÇÃO: Instalando um pacote usando RStudio# - RStudio > Extras pane > Packages tab > Install button# ==============================================================================# ARMADILHA: Também precisamos carregar o pacote antes de podermos usá-lostr_to_title(students) # error# ==============================================================================# LIÇÃO: Carregamos o pacote usando library()library("stringr")str_to_title(students) #finally works!# ==============================================================================# LIÇÃO: Também podemos manter nossos pacotes atualizados usando o RStudio# RStudio > Extras pane > Packages tab > Update button
Wrangle I
Princípio de Dados Tidy (Arrumados)
Existem muitas maneiras de armazenar dados
Aprenderemos o formato tidy data
Os dados devem ser retangulares
Cada variável tem sua própria coluna
Cada observação tem sua própria linha
Cada valor tem sua própria célula
Other Data Advice
Name all variables in the first row
This is called a header row
Avoid merged cells for data storage
These are okay for communication
Avoid empty cells whenever possible
Mark missing data as NA
Avoid formatting-as-data for storage
e.g., non-redundant color-coding
Tidying Example 1
Not Tidy
Name
Ann
Bob
Cat
Dom
Age
13
10
11
11
Weight
56.4
46.8
41.3
43.3
❌ Here, each row is a variable and each column is an observation.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, each column is a variable and each row is an observation.
Tidying Example 2
Not Tidy
Names:
Ann
Bob
Cat
Dom
Age
Weight
13
56.4
10
46.8
11
41.3
11
43.3
❌ Here, we have data that is not rectangular because the Names variable has its own row.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, we have made the data rectangular by moving the Names variable to its own column.
Tidying Example 3
Not Tidy
country
year
cases / population
Afghanistan
1999
NA / 19987071
2000
2666 / 20595360
Brazil
1999
37737 / 172006362
2000
80488 / 174504898
China
1999
212258 / 1272915272
2000
213766 / 1280428583
❌ Here, we have merged cells and two values stored in a single cell.
Tidy
country
year
cases
population
Afghanistan
1999
NA
19987071
Afghanistan
2000
2666
20595360
Brazil
1999
37737
172006362
Brazil
2000
80488
174504898
China
1999
212258
1272915272
China
2000
213766
1280428583
✔️ Here, we have un-merged the countries and separated the cases and populations variables into columns.
Tidying Example 4
Not Tidy
student
grade
Amber
91.5
A-
Bristol
86.2
B
Charlene
94.0
A
Diego
89.3
B+
Legend: Psych. Major, Psych. Minor
❌ Here, we have a missing variable name and formatting-as-data.
Tidy
student
psych
grade
letter
Amber
major
91.5
A-
Bristol
minor
86.2
B
Charlene
major
94.0
A
Diego
NA
89.3
B+
✔️ Here, we have added a column for the psych variable, removed the legend, and named the letter variable.
Tidying Example 5
Not Tidy
student
grade
letter
Amber
91.5
A-
Bristol*
94.2
A
Class Summary
As
2
Yay!
Bs
0
*Grade was revised.
❌ Here, we have two types of data in one file and a footnote as data.
Tidy
student
grade
letter
revised
Amber
91.5
A-
FALSE
Bristol
94.2
A
TRUE
letter
count
notes
A
2
Yay!
B
0
✔️ Here, we have split the data into two separate tables and added the revised and notes variables.
Long vs. Wide Format
Wide Format
date
Boeing
Amazon
Google
2009-01-01
$173.55
$174.90
$174.34
2009-01-02
$172.61
$171.42
$170.04
✔️ Here, we have a wide format where each observation is a date.
Long Format
date
stock
price
2009-01-01
Boeing
$173.55
2009-01-01
Amazon
$174.90
2009-01-01
Google
$174.34
2009-01-02
Boeing
$172.61
2009-01-02
Amazon
$171.42
2009-01-02
Google
$170.04
✔️ Here, we have a long format where each observation is the combination of a date and a stock.
Tibbles
R works particularly well with tidy data
We store tidy data in data frames or tibbles
Tibbles are just fancier data frames (i.e., they have a few extra features)
To use tibbles, we need the tidyverse package
Tibbles are constructed from one or more vectors
The vectors must have the same length
They can contain different types of data
Vectors
We start with three separate vector objects that all have the same length.
We set it up so that the \(n\)-th car in each train corresponds to the same observation.
Tibble
Then we combine the vectors into a single tibble (or data frame) object.
Now, as the tibble moves around, the variables always stay together.
Tibbles Live Coding
# SETUP: Install and load the tidyverse package# Extras pane > Packages tab > Installlibrary(tidyverse)# ==============================================================================# LESSON: Create a tibble from vectorsx <-c(10, 20, 30, 40)xy <- x *2-4ymy_tibble <-tibble(x, y)my_tibble# ==============================================================================# USECASE: You can mix different types of vectors in a single tibblefirst_names <-c("Adam", "Billy", "Caitlyn", "Debra")age_years <-c(12, 13, 10, NA)guests <-tibble(first_names, age_years)guests# ==============================================================================# TIP: To save time, you can also create the vectors in the tibble callgradebook <-tibble(grade =c(95, 83, 90, 76),letter =c("a", "b", "a-", "c"))gradebook# ==============================================================================# PITFALL: Don't try to combine tibbles with different lengthsy <-c(1, 2, 3)x <-c("a", "b")tibble(y, x) #error# ==============================================================================# LESSON: However, the exception is R will "recycle" a single valuetibble(y, x ="a")# ==============================================================================# LESSON: You can "extract" a vector from a tibble using $mytibble <-tibble(x =c(1, 2, 3, 4, 5), y ="test")mytibble$xmytibble$y# ==============================================================================# PITFALL: Don't try to extract a vector that doesn't existmytibble$z #error
Importing and Exporting
Data is usually stored in data files
Importing files into R is called reading
Exporting files from R is called writing
A convenient data file type is a CSV
This stands for comma-separated values
A CSV file is easy to share with other people
The tidyverse package can read/write CSVs
Other packages can read/write other types (e.g., readxl, haven, rio, googlesheets4)
Read/Write Live Coding
# SETUP: Load the tidyverse package (if you haven't yet)library(tidyverse)# ==============================================================================# USECASE: Create a tibble and write it to a filegradebook <-tibble(id =c(123, 456, 789),grade =c("A", "B", "A"))gradebookwrite_csv(gradebook, file ="gradebook.csv")# NOTE: You can see the new file in Extras pane > Files tab.# You can open the file in another program (e.g., Microsoft Excel).# You can also email this file to someone else to share it.# ==============================================================================# PITFALL: Don't swap the order of the tibble and the filewrite_csv("gradebook.csv", gradebook) # error# ==============================================================================# USECASE: Read in a file containing dataold_gradebook <-read_csv("gradebook.csv")old_gradebook# NOTE: read_csv() will examine and guess the data type of each variable.# You can tell it the data type of each variable, but that is more advanced.# ==============================================================================# PITFALL: Don't use the read.csv() and write.csv() functionsold_gradebook <-read.csv("gradebook.csv") # not a tibbleold_gradebook
Wrangle II
Basic wrangling verbs
tidyverse provides tools for wrangling tibbles
These functions are named after verbs
So if you name your objects after nouns…
…your code becomes easier to read
Noun(noun) ❌
Verb(noun) ✔️
blender(fruit)
blend(fruit)
screwdriver(screw)
drive(screw)
boxcutter(box)
cut(box)
Column-focused verbs
Select retains only certain columns/variables
select(TBL, VAR_KEEP, -VAR_DROP)
Mutate adds or transforms columns/variables
mutate(TBL, NEW_VAR = OLD_VAR / 1000)
Rename changes the names of columns/variables
rename(TBL, NEW_NAME = OLD_NAME)
Relocate changes the order of columns/variables
relocate(TBL, VAR_MOVE, .after = OTHER_VAR)
Select Live Coding
# SETUP: Load package and inspect example tibblelibrary(tidyverse) # includes the dplyr packagestarwars# ==============================================================================# USECASE: Retain only the specified variablessw <-select(starwars, name)swsw <-select(starwars, name, sex, species)sw# ==============================================================================# PITFALL: Don't forget to save the change with assignmentselect(starwars, name, sex, species)starwars # still includes all variables# ==============================================================================# USECASE: Retain all variables between two variablessw <-select(starwars, name, hair_color:eye_color)sw# ==============================================================================# USECASE: Retain all variables except the specified onessw <-select(starwars, -sex, -species)swsw <-select(starwars, -c(sex, species))swsw <-select(starwars, -c(hair_color:starships))sw
# USECASE: Change the name of one or more variablesstarwarssw <-rename(starwars, Character = name)swsw <-rename(starwars, height_cm = height, mass_kg = mass)sw# ==============================================================================# PITFALL: Don't swap the order and try old_name = new_namesw <-rename(starwars, name = Character) # error# ==============================================================================# USECASE: Move variables before or after another variablestarwarssw <-relocate(starwars, species, sex, .before = height)swsw <-relocate(starwars, species, sex, .after = name)sw# ==============================================================================# PITFALL: Don't forget the period!sw <-relocate(starwars, sex, before = height) sw # height was accidentally renamed to before
Row-focused verbs
Arrange sorts rows based on their values
arrange(TBL, VAR_SORT_UP)
arrange(TBL, desc(VAR_SORT_DOWN))
arrange(TBL, VAR_SORT_1ST, VAR_SORT_2ND)
Filter retains certain rows based on criteria
filter(TBL, DBL_CRIT > 0)
filter(TBL, STR_CRIT == "A")
filter(TBL, CRIT1, CRIT2)
Arrange Live Coding
# USECASE: Sort observations by a variablestarwarssw <-arrange(starwars, height)sw # sorted by height, ascendingsw <-arrange(starwars, name)sw # sorted by name, alphabetically# ==============================================================================# USECASE: Sort observations by a variable, in reverse ordersw <-arrange(starwars, desc(height))sw # sorted by height, descendingsw <-arrange(starwars, desc(name))sw # sorted by name, reverse-alphabetically# ==============================================================================# USECASE: Sort observations by multiple variablessw <-arrange(starwars, hair_color, mass)sw # sorted by hair_color, then ties broken by mass
Filter Live Coding
# USECASE: Retain only observations that meet a criterionsw <-filter(starwars, mass >100)sw # only observations with mass greater than 100sw <-filter(starwars, mass <=100)sw # only observations with mass less than or equal to 100sw <-filter(starwars, species =="Human")sw # only observations with species equal to Humansw <-filter(starwars, species !="Human")sw # only observations with species not equal to Human# ==============================================================================# PITFALL: Don't try to use a single = for testing equalitysw <-filter(starwars, height =150) # errorsw <-filter(starwars, height ==150) # correctsw # ==============================================================================# PITFALL: Don't forget that R is case-sensitivesw <-filter(starwars, species =="human")sw # no observations left (because it should be Human)# ==============================================================================# USECASE: Retain only observations that meet complex criteriasw <-filter(starwars, mass >100& height >200)sw # only observations with mass over 100 AND height over 200sw <-filter(starwars, height <100| hair_color =="none")sw # only observations with height under 100 OR hair_color equal to none# ==============================================================================# PITFALL: Don't forget to complete both conditionssw <-filter(starwars, mass >100&<200) # errorsw <-filter(starwars, mass >100& mass <200) # correctsw# ==============================================================================# PITFALL: Don't try to equate a string to a vectorsw <-filter(starwars, species ==c("Human", "Droid")) # errorsw <-filter(starwars, species %in%c("Human", "Droid")) # correctsw
Filter Cheatsheet
Symbol
Description
Num
Chr
<
Less than
Yes
No
<=
Less than or equal to
Yes
No
>
More than
Yes
No
>=
More than or equal to
Yes
No
==
Equal to
Yes
Yes
!=
Not equal to
Yes
Yes
%in%
Found in
Yes
Yes
&
Logical And
Yes
Yes
|
Logical Or
Yes
Yes
Wrangle III
Pipes & Pipelines
How can we do multiple operations to an object?
x <- 10
x2 <- sqrt(x)
x3 <- round(x2)
This works but is cumbersome and error-prone
A better approach is to use pipes and pipelines
x3 <- 10 |> sqrt() |> round()
I like to read |> as “and then…”
“Take 10 and then sqrt() and then round()”
Pipes Live Coding
# SETUP: Enable the pipe operator shortcut# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)# ==============================================================================# LESSON: The pipe pushes objects to a function as its first argument# TEMPLATE: x |> function_name() is the same as function_name(x)x <-10y <-sqrt(x)yy <- x |>sqrt()y# ==============================================================================# PITFALL: Don't forget to remove the object from the function callx |>sqrt(x) # wrongx |>sqrt() # correct# ==============================================================================# USECASE: You can still use arguments when pipingz <-round(3.14, digits =1)zz <-3.14|>round(digits =1)z# ==============================================================================# USECASE: Pipes are useful with tibbles and wrangling verbsstarwarssw <-select(starwars, name, species, height)swsw <- starwars |>select(name, species, height)sw# ==============================================================================# PITFALL: Don't add a pipe without a step after itsw <- starwars |>select(name, species, height) |># error
Pipelines Live Coding
# USECASE: You can chain multiple pipes together to make a pipelinex <-10|>sqrt() |>round()x# ==============================================================================# TIP: If you want to see the output of a pipeline, you can pipe to print()x <-10|>sqrt() |>round() |>print()# ==============================================================================# TIP: To make your pipelines more readable, move each step to a new linex <-10|>sqrt() |>round() |>print()# ==============================================================================# PITFALL: Don't put the pipe at the beginning of a line, thoughx <-10|>sqrt()|>round()|>print() # error# ==============================================================================# USECASE: Chain together a series of verbs to flexibly wrangle datatallones <- starwars |>select(name, species, height) |>rename(height_cm = height) |>mutate(height_ft = height_cm /30.48) |>filter(height_ft >7) |>arrange(desc(height_ft)) |>print()
Factors
Factors are used to represent categorical data
Factors have multiple possible levels
Levels are discrete and mutually-exclusive
Sometimes categories are unordered (nominal)
Action or Comedy or Drama
Asia or Europe or North America
Sometimes categories are ordered (ordinal)
Mild < Medium < Hot
XS < S < M < L < XL
Factors Live Coding
# USECASE: Ask 10 kids to order 1: nuggets, 2: pizza, or 3: saladfood <-c(2, 2, 1, 2, 1, 2, 1, 1, 2, 2)food# ==============================================================================# LESSON: We can turn this vector into a factor with factor()food2 <-factor(food, levels =c(1, 2, 3))food2food3 <-factor(food, levels =c(1, 2, 3),labels =c("nuggets", "pizza", "salad"))food3# ==============================================================================# USECASE: We can also quickly and easily count each level with table()table(food3)# ==============================================================================# PITFALL: Don't confuse levels and labelsfood4 <-factor(food, labels =c(1, 2, 3),levels =c("nuggets", "pizza", "salad"))food4 # full of <NA> because it can't find these levels# ==============================================================================# USECASE: You can also just enter strings directly (as self-labels)genre <-c("pop", "metal", "pop", "rock", "rap", "rap", "pop", "rock")genregenre2 <-factor(genre) # observed levels will be assigned alphabeticallygenre2table(genre2)# ==============================================================================# LESSON: If ordinal, enter levels low-to-high and add ordered = TRUEsalsa <-c("hot", "mild", "medium", "mild", "medium", "medium")salsa2 <-factor(salsa, levels =c("mild", "medium", "hot"), ordered =TRUE)salsa2 # NOTE: We may want to visualize or model ordinal factors differently# ==============================================================================# USECASE: Working with factors in a tibblecereal <-read_csv("cereal.csv")cerealcereal2 <-mutate(cereal, mfr =factor(mfr), type =factor(type))cereal2table(cereal2$mfr)table(cereal2$type)
Missing Values
Sometimes your data will have missing values
Perhaps these were never collected
Perhaps the values were lost/corrupted
Perhaps the participant didn’t respond
We need to tell R which values are missing
To do so, we set those values to NA
Functions from tidyverse make this easy
Missingness is often “contagious” in R e.g., a vector with NA has an unknown mean
Missing Values Live Coding
# SETUP: We will need tidyverse for the read and mutate functionslibrary(tidyverse)# ==============================================================================# PITFALL: Number codes for missingness will mess up calculations in Rheights <-c(149, 158, -999) # here we use -999 to represent a missing valuerange(heights)mean(heights)log(heights) # our missing value is no longer -999# ==============================================================================# USECASE: Use NA for missingness insteadheights2 <-c(149, 158, NA)heights2log(heights2) # the NA stayed an NA (due to contagiousness)# ==============================================================================# LESSON: Use na.rm = TRUE to do a summary function ignoring the NAsmean(heights2) # the mean is an NA (due to contagiousness)mean(heights2, na.rm =TRUE)range(heights2, na.rm =TRUE)# ==============================================================================# USECASE: Dealing with missing values in tibblescereal <-read_csv("cereal.csv")cereal$ratingrange(cereal$rating)# ==============================================================================# LESSON: Use na_if() to convert specific values to NA while mutatingcereal2 <-mutate(cereal, rating =na_if(rating, -999))cereal2$ratingrange(cereal2$rating, na.rm =TRUE)# ==============================================================================# LESSON: Use read_csv(na) to convert specific values to NA while readingcereal3 <-read_csv("cereal.csv", na ="-999")cereal3$ratingrange(cereal3$rating, na.rm =TRUE)
Wrangle IV
Summarize
Although we store data about many observations…
…we often want to summarize across observations
This is like folding the tibble down to one row
We’ve seen functions that summarize vectors
length(), sum(), min(), max()
mean(), median(), sd(), var()
summarize() lets us use them on tibbles
It works very similarly to mutate()
It always creates a tibble as output
Summarize Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)sales <-tibble(customer =c(1, 2, 3, 1, 3),store =c("A", "A", "A", "B", "B"),items =c(25, 20, 16, 10, 5),spent =c(685, 590, 392, 185, 123) ) |>print()# ==============================================================================# USECASE: Summarize the typical salesmy_summary <- sales |>summarize(avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# PITFALL: Don't use summary() instead of summarize()my_summary <- sales |>summary(avg_items =mean(items),avg_spent =mean(spent) ) |>print() # not a tibble# ==============================================================================# USECASE: Use more than one summary functionmy_summary <- sales |>summarize(total_items =sum(items),total_spent =sum(spent),avg_items =mean(items),avg_spent =mean(spent) ) |>print()# ==============================================================================# USECASE: Use counting functionsmy_counts <- sales |>summarize(n_sales =n(),n_customers =n_distinct(customer),n_stores =n_distinct(store) ) |>print()
Group Summarize
We can also summarize a tibble by group
This is like folding the tibble multiple times
Specifically, we fold down to one row per group
The syntax for summarize is identical
The only difference is to the tibble
We first pass it through group_by()
Pipelines make this very easy
Group Summarize Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)sales <-tibble(customer =c(1, 2, 3, 1, 3),store =c("A", "A", "A", "B", "B"),items =c(25, 20, 16, 10, 5),spent =c(685, 590, 392, 185, 123) ) |>print()# ==============================================================================# LESSON: We pass a tibble through group_by to group itsalessales |>group_by(store) # note the display says "grouped"# ==============================================================================# USECASE: We can then summarize and get stats per groupsales |>group_by(store) |>summarize(customers =n_distinct(customer),items_sold =sum(items),total_sales =sum(spent),avg_items =mean(items),avg_spent =mean(spent) )# ==============================================================================# SETUP: Let's get a larger, more realistic dataset# Extra pane > Packages tab > Install > nycflights13library("nycflights13")flights# ==============================================================================# USECASE: Find the carrier with the lowest average delaysflights |>group_by(carrier) |>summarize(m_delay =mean(dep_delay, na.rm =TRUE)) |>arrange(m_delay)# ==============================================================================# LESSON: We can also group by multiple variables# USECASE: Let's find the day of the year with the most flightsflights |>group_by(month, day) |>summarize(n_flights =n()) |>arrange(desc(n_flights))
Visualize I
What is a graphic?
A data visualization expresses data through visual aesthetics.
Describing Graphics
Some simple graphics are easy to describe and may even have ready names.
Describing Graphics
A grammar of graphics will help us describe more complex graphics.
The Grammar of Graphics
The grammar of graphics is a set of rules for describing and creating data visualizations
To make our data visual (and therefore put our highly evolved occipital lobes to work)…
We connect variables to visual qualities
We represent observations as visual objects
This requires some fundamental elements
We will first learn about them in lecture
We will then apply them in R using {ggplot2}
Data
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Graphics require data (e.g., tibbles), which describe observations using variables.
Aesthetic Mappings
Graphics require aesthetic mappings, which connect data variables to visual qualities.
Scales
Graphics require scales, which connect specific data values to specific aesthetic values.
Geometric Objects
Graphics require geometric objects (geoms), which represent the observations.
ggplot2 Basics
The ggplot2 package is a part of tidyverse
No need to install or load it separately
It plays nicely with tibbles and wrangling
It implements the grammar of graphics in R
The “gg” stands for “grammar of graphics”
Thus, we will need to provide all four elements
We will create a pseudo-pipeline of commands
However, we will use + rather than |>
This is because {ggplot2} predates the R pipe
ggplot2 Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# LESSON: First, set the data to a tibblep <-ggplot(data = mpg)p# ==============================================================================# LESSON: Next, set the aesthetic mappings with aes()p <-ggplot(data = mpg, mapping =aes(x = displ, y = hwy))p# ==============================================================================# TIP: You can leave off the optional argument namesp <-ggplot(mpg, aes(x = displ, y = hwy))p# ==============================================================================# LESSON: Next, set the positional scalesp <-ggplot(mpg, aes(x = displ, y = hwy)) +scale_x_continuous(name ="Engine Size (in liters)", limits =c(1, 7), breaks =1:7 ) +scale_y_continuous(name ="Highway Fuel Efficiency (in miles/gallon)",limits =c(10, 50),breaks =c(10, 20, 30, 40, 50) )p# ==============================================================================# LESSON: Finally, add a point geomp <-ggplot(mpg, aes(x = displ, y = hwy)) +scale_x_continuous(name ="Engine Size (in liters)", limits =c(1, 7), breaks =1:7 ) +scale_y_continuous(name ="Highway Fuel Efficiency (in miles/gallon)",limits =c(10, 50),breaks =c(10, 20, 30, 40, 50) ) +geom_point()# ==============================================================================# TIP: If you leave off the scales, R will try to guessp <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point()p# ==============================================================================# LESSON: We can also customize the geom with argumentsp <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point(color ="red", shape ="square", size =2)p
Basic Layering
ggplot2 uses a layered grammar of graphics
We can keep stacking geoms on top
Layering adds a lot of possibilities
We can convey more complex ideas
We can learn more about our data
But we can still describe these graphics
Just describe each layer in turn
And describe the layers’ ordering
Basic Layering Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Add a smooth geom (i.e., line of best fit)ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth()ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth(method ="lm")# ==============================================================================# USECASE: Add a line geom (i.e., connecting points)economicsggplot(economics, aes(x = date, y = unemploy)) +geom_point()ggplot(economics, aes(x = date, y = unemploy)) +geom_point() +geom_line(color ="orange", size =1)ggplot(economics, aes(x = date, y = unemploy)) +geom_line(color ="orange", size =1) +geom_point()# ==============================================================================# USECASE: Add reference line geomsggplot(economics, aes(x = date, y = unemploy)) +geom_hline(yintercept =0, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point()ggplot(economics, aes(x = date, y = unemploy)) +geom_vline(xintercept =7.5, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point() ggplot(economics, aes(x = date, y = unemploy)) +geom_abline(intercept =4000, slope =0.5, color ="orange", size =1) +geom_line(color ="blue", size =1) +geom_point()
Working with Color
Color scales come in two main types:
Discrete scales have separate colors
Best with factor variables
Continuous scales form a gradient
Best with numeric variables
There are two ways to control color:
You can map color to a variable
It will take on different values
You can set color to a value
It will take on one value only
Color Live Coding
# SETUP: We will need tidyverse and an example datasetlibrary(tidyverse)mpg# ==============================================================================# USECASE: Continuous color scales work well with numeric variablesggplot(mpg, aes(x = hwy, y = cty, color = displ)) +geom_point(size =4)ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +geom_point(size =4) +scale_color_continuous(type ="viridis")# ==============================================================================# USECASE: Use a discrete color scale with categorical variablesggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point()ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +scale_color_discrete(name ="Drivetrain", breaks =c("4", "f", "r"), labels =c("Four Wheel", "Front Wheel", "Rear Wheel") )# ==============================================================================# PITFALL: Don't forget to set categorical variables as factorsggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +geom_point() # R guesses you want a continuous scaleggplot(mpg, aes(x = displ, y = hwy, color =factor(cyl))) +geom_point() +scale_color_discrete(name ="Cylinders")# ==============================================================================# LESSON: Set a geom's color aesthetic to make it always that colorggplot(mpg, aes(x = displ, y = hwy)) +geom_point(color ="red")# ==============================================================================# PITFALL: However, do this inside of geom() not aes()ggplot(mpg, aes(x = displ, y = hwy, color ="blue")) +geom_point() #unintended# ==============================================================================# LESSON: If you both set and map color, the setting will winggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point(color ="blue")
# SETUP: We will need tidyverse and an example graphiclibrary(tidyverse)p <-ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +geom_point() +labs(title ="Fuel Efficiency")p# ==============================================================================# USECASE: Apply a "complete" themep +theme_bw()p +theme_classic()p +theme_dark()# ==============================================================================# LESSON: More more precise control, we can use theme()p +theme(legend.position ="top")p +theme(plot.title =element_text(color ="purple", face ="bold"))p +theme(panel.grid =element_blank())# NOTE: There are a lot of elements to learn, so use a cheatsheet!
Exporting Graphics
We may need to export graphics from R
e.g., for a paper, poster, or presentation
This job is handling fantastically by ggsave()
We can create many types of files
We can customize the exact size
I recommend .png for most daily purposes
For publishing, I prefer .pdf or .svg
They retain perfect quality at any zoom
You can send these files to most publishers
Exporting Live Coding
# SETUP: We will need tidyverse and an example graphiclibrary(tidyverse)p <-ggplot(mpg, aes(x = displ, y = hwy)) +geom_point() +geom_smooth() +labs(x ="Engine Displacement", y ="Highway MPG")p# ==============================================================================# USECASE: Save a specific ggplot object to a fileggsave(filename ="pfinal.png", plot = p)# ==============================================================================# LESSON: Specify the size of the file to createggsave(filename ="pfinal2.png", plot = p, width =6, height =3, units ="in")# ==============================================================================# LESSON: Just change the extension to create a different file typeggsave(filename ="pfinal2.pdf", plot = p, width =6, height =3, units ="in")# ==============================================================================# PITFALL: Creating a very large file may lead to small textggsave(filename ="p_poster.png", plot = p, width =12, height =8, units ="in")# ==============================================================================# TIP: You can quickly increase the text size using base_sizep2 <- p +theme_grey(base_size =24)ggsave(filename ="p_poster2.png", plot = p2,width =12, height =8, units ="in")